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An amino acid sequence of a protein may be decomposed into consecutive overlapping strings 
of length K. How unique is the converse, i.e., reconstruction of amino acid sequences using the set 
of K -strings obtained in the decomposition? This problem may be transformed into the problem 
of counting the number of Eulerian loops in an Euler graph, though the well-known formula must 
be modified. By exhaustive enumeration and by using the modified formula, we show that the 
reconstruction is unique at K > 5 for an overwhelming majority of the proteins in pdb.SEQ database. 
The corresponding Euler graphs provide a means to study the structure of repeated segments in 
protein sequences. 

PACS number: 87.10+e 87.14Ee 



I. INTRODUCTION 

The composition of nucleotides in DNA sequences and the amino acids composition in protein sequences have been 
widely studied. For example, the g + c contents or CpG islands in DNAs have played an important role in gene-finding 
programs. However, this kind of study usually has been restricted to the frequency of single letters or short strings, 
e.g., dinucleotide correlations in DNA sequences JIJ], amino acids frequency in various complete genomes j|. However, 
in contrast to DNA sequences amino acid correlations in proteins have been much less studied. A simple reason 
might be that there are 20 amino acids and it is difficult to comprehend the 400 correlation functions even at the 
two-letter level. A more serious obstacle consists in that protein sequences are too short for taking averages in the 
usual definition of correlation functions. 

For short sequences like proteins one should naturally approach the problem from the other extreme by applying 
more deterministic, non-probabilistic methods. In fact, the presence of repeated segments in a protein is a strong 
manifestation of amino acid correlation. This problem has a nice connection to the number of Eulerian loops in Euler 
graphs. Therefore, we start with a brief detour to graph theory. 

II. NUMBER OF EULERIAN LOOPS IN AN EULER GRAPH 

Eulerian paths and Euler graphs comprise a well-developed chapter of graph theory, see, e.g., ||. We collect a few 
definitions in order to fix our notation. Consider a connected, directed graph made of a certain number of labeled 
nodes. A node i may be connected to a node j by a directed arc. If from a starting node vq one may go through 
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a collection of arcs to reach an ending node Vf in such a way that each arc is passed once and only once, then it is 
called an Eulerian path. If vq and Vf coincide the path becomes an Eulerian loop. A graph in which there exists an 
Eulerian loop is called an Eulerian graph. An Eulerian path may be made an Eulerian loop by drawing an auxiliary 
arc from v/ back to vq. We only consider Euler graphs defined by an Eulerian loop. 

From a node there may be c? ou t arcs going out to other nodes, d ou t is called the outdegree (fan-out) of the node. 
There may be di n arcs coming into a node, di n being the indegree (fan-in) of the node. The condition for a graph to 
be Eulerian was indicated by Euler in 1736 and consists in 

<kn(i) = d on t(i) = di — an even number 

for all nodes i. 

Numbering the nodes in a certain way, we may put their indegrees as a diagonal matrix: 

M = diag(di,d 2 , ■■■,d m ). (1) 

The connectivity of the nodes may be described by an adjacent matrix A — {fty}, where is the number of arcs 
leading from node i to node j. 

From the M and A matrices one forms the Kirchhoff matrix: 

C = M- A. (2) 

The Kirchhoff matrix has the peculiar property that its elements along any row or column sum to zero: cy = 0, 
^2 ■ Cij = 0. Further more, for an m x m Kirchhoff matrix all (m — 1) x (m — 1) minors are equal and we denote it by 
A. 

A graph is called simple if between any pairs of nodes there are no parallel (repeated) arcs and at all nodes there 
are no rings, i.e., ay = or 1 Vi, j and an = Vi. The number R of Eulerian loops in a simple Euler graph is given 
I'.v 

The BEST Theorem [§ (BEST stands for N. G. de Bruijn, T. van Aardenne-Ehrenfest, C. A. B. Smith, and W. 
T. Tutte): 



R=A]J(d i -iy. (3) 



For general Euler graphs, however, there may be arcs going out and coming into one and the same node (some 
an =/= 0) as well as parallel arcs leading from node i to j (aij > 1). It is enough to put auxiliary nodes on each parallel 
arc and ring to make the graph simple. The derivation goes just as for simple graphs and the final result is one has 
the original graph without auxiliary nodes but with an ^ and a^ > 1 incorporated into the adjacent matrix A. 
However, in accordance with the unlabeled nature of the parallel arcs and rings one must eliminate the redundancy 
in the counting result by dividing it by a^ ! . Thus the BEST formula is modified to 

AELK-i)! 



R = fT I , ( 4 ) 
As 0! = l! = 1 Eq. (Q) reduces to (||) for simple graphs. 



III. EULERIAN GRAPH FROM A PROTEIN SEQUENCE 

We first decompose a given protein sequence of length L into a set of L — K + 1 consecutive overlapping if -strings 
by using a window of width K , sliding one letter at a time. Combining repeated strings into one and recording their 
copy number, we get a collection {W^, rij}^L 1 ^ where M < L — K + 1 is the number of different if-strings. 

Now we formulate the inverse problem. Given the collection {W^ ,nj} I ^ =l obtained from the decomposition of a 
given protein, reconstruct all possible amino acid sequences subject to the following requirements: 

1. Keep the starting if -string unchanged. This is because most protein sequences start with methionine (M); even 
the tRNA for this initiation M is different from that for elongation. This condition can easily be relaxed. 

2. Use each string rij times and only rij times until the given collection is used up. 

3. The reconstructed sequence must reach the original length L. 
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Clearly, the inverse problem has at least one solution — the original protein sequence. It may have multiple solutions. 
However, for K big enough the solution must be unique as evidenced by the extreme case K = L — 1. We are 
concerned with how unique is the solution for real proteins. Our guess is for most proteins the solution is unique at 
K > 5. 

In order to tell the number of reconstructed sequences we transform the original protein sequence into an Euler 
graph in the following way. Consider the two (K — l)substrings of a K-stving as two nodes and draw a directed arc 
to connect them. The same repeated (K — l)-strings are treated as a single node with more than one incoming and 
outgoing arcs. 

Take the SWISS-PROT entry ANPA.PSEAM as an example [Q. This antifreeze protein A/B precursor of winter 
flounder has a short sequence of 82 amino acids and some repeated segments related to alanine-rich helices. Its 
sequence reads: 



MALSLFTVGQ LIFLFWTMRI TEASPDPAAK AAPAAAAAPA AAAPDTASDA AAAAALTAAN 
AKAAAELTAA NAAAAAAATA RG 



Consider the case K = 5. The first 5-string MALSL gives rise to a transition from node MALS to ALSL 
by one letter, from the next 5-string ALSLF we get an arc from node 



ALSL 



to node 



LSLF 



and so on, 



Shifting 
and so forth. 



Clearly, we get an Eulerian path whose all nodes have even indegree (outdegree) except for the first and the last 
nodes. Then we draw an auxiliary arc from the last node 



TARG 



back to the first 



MALS 



to get a closed Eulerian loop. 



In order to get the number of Eulerian loops there is no need to generate a fully-fledged graph with all the M 
distinct (K — l)-strings treated as nodes. The number of nodes may be reduced by replacing a series of consecutive 
nodes with d m — d out = 1 by a single arc, keeping the topology of the graph unchanged. In other words, only those 
strings in {Wj ~^,rij} with rij > 2 are used in drawing the graph. In our example it reduces to a small Euler graph 
consisting of 9 nodes: 

{AKAA, 2; AAPA, 2; APAA, 2; PAAA, 2; AAAA, 10; AAAP, 2; LTAA, 2; TAAN, 2; AANA, 2}. 

The Kirchhoff matrix is: 



C = 



The minor A = 192 and 
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(5) 



We write R{K) to denote the number of reconstructed sequences from a decomposition using X-strings. 

We note, however, precautions must be taken with spurious repeated arcs caused by the reduction of number of 
nodes. In calculating the Y\ij a ij m the denominator of Eq. (^) one must subtract the number of spurious repeated 
arcs from the corresponding matrix element of the adjacent matrix. This remark applies also to the auxiliary arc 
obtained by connecting the last node to the first. Fortunately, there are no such spurious arcs in the example above. 

We have written a program to exhaustively enumerate the number of reconstructed amino acid sequences from 
a given protein sequence and another program to implement the Eq. (Q). The two programs yield identical results 
whenever comparable — the enumeration program skips the sequence when the number of reconstructed sequences 
exceeds 10000. 



IV. RESULT OF DATABASE INSPECTION 



We used the two programs to inspect the 2820 proteins in the special selection PDB.SEQ y. The summary is given 
in Table B. As expected most of the proteins lead to unique reconstruction even at K = 5. At K = 10 such proteins 
make 99% of the total. 
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TABLE I. Distribution of the 2820 proteins in pdb.SEQ by the number of reconstructed sequences at different K. Percentages 
in parentheses are given in respect to the total number 2820. 



K 


Unique 


2-10 


11-100 


101-1000 


1001-10000 


> 10000 


5 


2164 (76.7%) 


404 


90 


45 


21 


93 


6 


2651 (94.0%) 


77 


29 


10 


4 


49 


7 


2732 (96.9%) 


32 


16 


3 


2 


44 


8 


2740 (97.1%) 


23 


10 


3 





44 


9 


2763 (97.9%) 


13 


7 


1 





36 


10 


2793 (99.0%) 


11 


7 


2 


1 


6 


11 


2798 (99.2%) 


12 


2 


1 


1 
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The fact that most of the protein sequences have unique reconstruction is not surprising if we note that for a 
random amino acid sequence of the length of a typical protein one would expect R = 1 at K = 5, as it is very unlikely 
that its decomposition may yield repeated pairs of i^-strings among the 20 5 = 3200000 possible strings. A more 
positive implication of this uniqueness is one may take the collection of {W^}^ =1 as an equivalent representation 
of the original protein sequence. This may be used in inferring phylogcnctic relations based on complete genomes 
when it is impossible to start with sequence alignment. We will report our on-going work along this line in a separate 
publication Q. 

A more interesting result of the database screening consists in there exists a small group of proteins which have 
an extremely large number of reconstructed sequences. The number R is not necessarily related to the length of the 
protein. As a rule, long protein sequences, say, with 2000 or more amino acids, tend to have larger R at K = 5 or 
so, but the number drops down quickly. In fact, all 29 proteins in PDB.SEQ with more than 2115 amino acids have 
unique or a small number of reconstructed sequences. Some not very long proteins have much more reconstructions 
than the long ones. We show a few "mild" examples in Table |J. 

TABLE II. A few examples of protein decomposition with comparatively large R at K = 5. AA is the number of amino 
acids in the protein. 



Protein MCMI.YEAST PLMNJHUMAN CENB.HUMAN CERUJIUMAN 

AA 286 810 599 1065 

R(5) 7441920 3024000 491166720 3507840 

R(6) 39312 384 17421 512 

R(7) 1620 192 90 21 

R(8) 252 96 12 6 

R(9) 16 5 4 1 

R(10) 2 1 1 

R(ll) 1 
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The inspection is being extended to all available protein sequences in public databases. 



V. DISCUSSION 



In this paper we have given some precise construction and numbers associated with real protein sequences. Their 
biological implications have to be yet explored. 

As mentioned in Section IV , we have been using the uniqueness of the reconstruction for most protein sequences to 
justify the compositional distance approach to infer phylogenetic relations among procaryotes based on their complete 
genomes H . Most of the phylogenetic studies so far consider mutations at the sequence level. Sequences of more or 
less the same length are aligned and distances among species are derived from the alignments. However, mutations 
from a common ancestral sequence reflect only one way of evolution. There might be another way of protein evolution 
— short polypeptides may fuse to form longer proteins. Perhaps our approach may better capture the latter situation. 

The decomposition and reconstruction described in this paper provide a way to study polypeptide repeats and 
amino acid correlations. The reconstruction problem naturally singles out a small group of proteins that have a 
complicated structure of repeated segments. One may introduce further coarse-graining by reducing the cardinality 
of the amino acid alphabet according to their biological properties. This makes the approach closer to real proteins. 
Investigation along these lines are under way. 

We note that the Eulerian path problem has been invoked in the study of sequencing by hybridization, i.e., in the 
context of RNA or DNA sequences, see [[| and references therein. To the best of our knowledge the modification of 
the BEST formula to take into account parallel arcs and rings has not been discussed so far. 
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